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Abstract 

Our paper deals with inferring simulator-based statistical models given some observed data. 

A simulator-based model is a parametrized mechanism which specifies how data are gener¬ 
ated. It is thus also referred to as generative model. We assume that only a finite number 
of parameters are of interest and allow the generative process to be very general; it may be 
a noisy nonlinear dynamical system with an unrestricted number of hidden variables. This 
weak assumption is useful for devising realistic models but it renders statistical inference 
very difficult. The main challenge is the intractability of the likelihood function. Several 
likelihood-free inference methods have been proposed which share the basic idea of iden¬ 
tifying the parameters by finding values for which the discrepancy between simulated and 
observed data is small. A major obstacle to using these methods is their computational 
cost. The cost is largely due to the need to repeatedly simulate data sets and the lack 
of knowledge about how the parameters affect the discrepancy. We propose a strategy 
which combines probabilistic modeling of the discrepancy with optimization to facilitate 
likelihood-free inference. The strategy is implemented using Bayesian optimization and is 
shown to accelerate the inference through a reduction in the number of required simulations 
by several orders of magnitude. 

Keywords: intractable likelihood, latent variables, Bayesian inference, approximate Bayesian 
computation, computational efficiency 


1. Introduction 

We consider the statistical inference of a finite number of parameters of interest 0 G 
of a simulator-based statistical model for observed data yo which consist of n possibly 
dependent data points. A simulator-based statistical model is a parametrized stochastic 
data generating mechanism. Formally, it is a family of probability density functions (pdfs) 
{Py\e}9 of unknown analytical form which allow for exact sampling of data yg ~ Py\e- In 
practical terms, it is a computer program which takes a value of 9 and a state of the random 
number generator as input and returns data yg as output. Simulator-based models are also 
called implicit models because the pdf of yg is not specified explicitly (Diggle and Gratton, 
1984), or generative models because they specify how data are generated. 
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Simulator-based models are useful because they interface easily with models typically 
encountered in the natural sciences. In particular, hypotheses of how the observed data 
Yo were generated can be implemented without making excessive compromises in order to 
have an analytically tractable model pdf Py\e- 

Since the analytical form of is unknown, inference using the likelihood function 

£( 0 ), 

^0) = Py\e{yo\0), (1) 

is not possible. The likelihood function is also not available for a large class of other statis¬ 
tical models which are known as unnormalized models. In these models, Py^g is only known 
up to a normalizing scaling factor (the partition function) which guarantees that Py^g is a 
valid pdf for all values of 0. Simulator-based models differ from unnormalized models in 
that not only is the scaling factor unknown but also the shape of Py\g- Likelihood-free in¬ 
ference methods developed for unnormalized models (for example Hinton, 2002; Hyvarinen, 
2005; Pihlaja et ah, 2010; Gutmann and Hirayama, 2011; Gutmann and Hyvarinen, 2012) 
are thus not applicable to simulator-based models. 

For simulator-based models, likelihood-free inference methods have emerged in multiple 
disciplines. “Indirect inference” originated in economics (Gourieroux et ah, 1993), “approx¬ 
imate Bayesian computation” (ABG) in genetics (Beaumont et ah, 2002; Marjoram et ah, 
2003; Sisson et ah, 2007), or the “synthetic likelihood” approach in ecology (Wood, 2010), 
for an overview, see, for example, the review by Hartig et ah (2011). The different meth¬ 
ods share the basic idea to identify the model parameters by finding values which yield 
simulated data that resemble the observed data. 

The generality of simulator-based models comes with the expense of two major diffi¬ 
culties in the inference. One difficulty is the assessment of the discrepancy between the 
observed and simulated data (Joyce and Marjoram, 2008; Wegmann et ah, 2009; Nunes 
and Balding, 2010; Fearnhead and Prangle, 2012; Aeschbacher et ah, 2012; Gutmann et ah, 
2014). The other difficulty is that the inference methods tend to be slow due to the need to 
simulate a large collection of data sets and due to the lack of knowledge about the relation 
between the model parameters and the corresponding discrepancies. 

In this paper, we address the computational difficulty of the likelihood-free inference 
methods. We propose a strategy which combines probabilistic modeling of the discrepan¬ 
cies with optimization to facilitate likelihood-free inference. The strategy is implemented 
using Bayesian optimization (see, for example, Brochu et ah, 2010). We show that us¬ 
ing Bayesian optimization in likelihood-free inference (BOLFI) can reduce the number of 
required simulations by several orders of magnitude, which accelerates the inference sub¬ 
stantially. ^ 

The rest of the paper is organized as follows: In Section 2, we present examples of 
simulator-based statistical models to help clarify their properties. In Section 3, we provide 
a unified review of existing inference methods for simulator-based models, and use the 
examples to point out computational issues. The computational difficulties are summarized 
in Section 4, and a framework to address them is outlined in Section 5. Section 6 implements 

1. Preliminary results were presented at “Approximate Bayesian Computation in Rome”, 2013, and MCMCSki IV, 
2014, as a poster “Bayesian optimization for efficient likelihood-free inference”, and at the NIPS workshop “ABC in 
Montreal”, 2014, as part of an oral presentation. 
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the framework using Bayesian optimization. Applications of the developed methodology are 
given in Section 7, and Section 8 concludes the paper. 

2. Examples of Simulator-Based Statistical Models 

We present here three examples of simulator-based statistical models. The first example is 
an artificial one, but useful because it allows us to illustrate the central concepts. The other 
two are examples from real data analysis with intractable models (Wood, 2010; Numminen 
et ah, 2013). The examples will be used throughout the paper and the model details can 
be looked up here when needed. 

Example 1 (Normal distribution). A standard way to sample data = {yg^\ ■ ■ ■ 
from a normal distribution with mean 9 and variance one is to sample n standard normal 
random variables u = and to add 6 to the obtained samples, 

ye = e + u, u;~AA(0,I„). (2) 

The symbol AA(0, I„) denotes a n-variate normal distribution with mean zero and identity 
covariance matrix. After sampling of the random quantities cj, the observed data yo are a 
deterministic transformation of ui and the parameter 6. For more general simulators, the 
same principle applies. In particular, the data ye are a deterministic transformation of 6 if 
the random quantities are kept fixed, for example by fixing the seed of the random number 
generator. ▲ 

Example 2 (Ricker model). In this example, the simulator consists of a latent stochastic 
time series and an observation model. The latent time series is a stochastic version of the 
Ricker map which is a classical model in ecology (Ricker, 1954). The stochastic version can 
be described as a nonlinear autoregressive model, 

logAiW =logr + logA^(*-^)+cjeW, t = l,...,n, N® = 0, (3) 

where is the size of some animal population at time t and the are independent 
standard normal random variables. The latent time series has two parameters: logr which 
is related to the log growth rate and a for the standard deviation of the innovations. A 
Poisson observation model is assumed, such that given y^^ is drawn from a Poisson 

distribution with mean 

yg'^\N^^\ip ~ Poisson((/?N*^*^), (4) 

where is a scaling parameter. The model is thus in total parametrized hy 0 = (log r, a, p). 
Figure 1(a) shows example data generated from the model. Inference of 9 is difficult because 
the are not directly observed and because of the strong nonlinearity in the autoregres¬ 
sive model. Wood (2010) used this example to illustrate his “synthetic likelihood” approach 
to inference. ▲ 

Example 3 (Bacterial infections in day care centers). The data generating process is here 
defined via a latent continuous-time Markov chain and an observation model. The model 
was developed by Numminen et al. (2013) to infer the transmission dynamics of bacterial 
infections in day care centers. 
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(a) Ricker model (b) Bacterial infections in day care centers 

Figure 1: Examples of simulator-based statistical models, (a) Data generated from the Ricker model 
in Example 2 with n = 50 and 9o = (log Tq, cTo, </?o) = (3.8, 0.3,10). (b) Data generated 
from the model in Example 3 on bacterial infections in day care centers. There are 33 
different strains of the bacterium in circulation and Mdcc = 36 of the 53 attendees of 
the day care center were sampled (Numminen et al., 2013). Each black square indicates 
a sampled attendee who is infected with a particular strain. The data were generated with 
6»o = (/3o,Ao,0o) = (3.6,0.6,0.1). 




The variables of the latent Markov chain are the binary indicator variables which 
specify whether attendee i of a day care center is infected with the bacterial strain s at time 
t {ijg = 1), or not = 0). Starting with zero infected individuals, = 0 for all i and s, 
the states evolve in a stochastic manner according to the rate equations 

P(4+" = 0|4 = l) = /r + o(/i), (5) 

P(4+'* = l\li, = 0 Vs') = Rs{t)h + o{h), (6) 

P(4+^ = 1|4 = 0, 3s' : 4, = 1) = eRs{t)h + o{h), (7) 

where /i is a small time interval and Rs{t) the rate of infection with strain s at time t. The 
three equations model the probability to clear a strain s during time t and t + h (Equation 

5) , the probability to be infected with a strain s if not colonized by other strains (Equation 

6) , and the probability to be infected if colonized with other strains (Equation 7). The 
rate of infection is a weighted combination of the probability Pg for an infection happening 
outside the day care center and the probability Es{t) for an infection from within, 

Rsif) = j3Es{t) + KPs- ( 8 ) 

We refer the reader to the original publication by Numminen et al. (2013) for more de¬ 
tails and the expression for Es{t). The observation model was random sampling of Mdcc 
individuals without replacement from all the individuals attending a day care center at 
some sufficiently large random time (endemic situation). The model has three parameters 
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6 = (/3,A,6): the internal infection parameter /3, the external infection parameter A, and 
the co-infection parameter 6. Figure 1(b) shows an example of data generated from the 
model. 

Numminen et ah (2013) applied the model to data on colonizations with the bacterium 
Streptococcus pneumoniae. The observed data yo were the states of the sampled attendees of 
29 day care centers, that is, 29 binary matrices as in Figure 1(b) but with varying numbers 
of sampled attendees per day care center. Inference of the parameters is difficult because 
the data are a snapshot of the state of some of the attendees at a single time point only. 
Since the process evolves in continuous-time, the modeled system involves infinitely many 
correlated unobserved variables. ▲ 

3. Inference Methods for Simulator-Based Statistical Models 

This section organizes the foundations and the previous work. We first point out proper¬ 
ties common to all inference methods for simulator-based models, one being the general 
manner of constructing approximate likelihood functions. We then explain parametric and 
nonparametric approximations of the likelihood and discuss the relation between the two 
approaches. This is followed by a summary of currently used posterior inference schemes. 

3.1 General Properties of the Different Inference Methods 

Inference of simulator-based statistical models is generally based on some measurement of 
discrepancy Ag between the observed data yo and data yg simulated with parameter value 
6. The discrepancy is used to define an approximation L{6) of the likelihood C{6). The 
approximation happens on multiple levels. 

On a statistical level, the approximation consists of reducing the observed data yo to 
some features, or summary statistics <ho before performing inference. The purpose of the 
summary statistics is to reduce the dimensionality and to filter out information which is not 
deemed relevant for the inference of 6. That is, in this first approximation, the likelihood 
C[0) is replaced with T(0), 

L(0) =p$|g(^>o|0), (9) 

where p$|g is the pdf of the summary statistics. The function L{6) is a valid likelihood 
function, but for the inference of 6 given <ho, and not for the inference of 6 given y^, in 
contrast to C(0), unless the chosen summary statistics happened to be sufficient in the 
standard statistical sense. 

The likelihood function L(6), however, is also not known, because the pdf p$|g is of 
unknown analytical form, which is a property inherited from Py\e- Thus, L{6) needs to be 
approximated by some method. We denote practical approximations obtained with finite 
computational resources by L{6). Limiting approximations if infinitely many computational 
resources were available will be denoted by L{0). 

In the paper, we will encounter several methods to construct L{6). They all base the 
approximation on simulated summary statistics ‘hg, generated with parameter value 0. The 
simulation of summary statistics is generally done by simulating a data set yg, followed by its 
reduction to summary statistics. Table 1 provides an overview of the different “likelihoods” 
appearing in the paper. 
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Symbol 

Meaning 

Definition 

C 

true likelihood based on observed data 

Eq(l) 

L 

true likelihood based on summary statisties 

Eq (9) 

L 

approximation of L requiring infinite eomputing power 

See 3.1 

Lsih) 

parametrie approx/synthetie (log) likelihood 

Eq (13) 

Lk, 

nonparametrie approx with kernel n 

Eq (22) 

Lu 

nonparametrie approx with uniform kernel 

Eq (25) 

L 

eomputable approximation of L 

See 3.1 


parametrie approx/synthetie (log) likelihood with sample averages 

Eq (15) 


nonparametrie approx with kernel n and sample averages 

Eq (21) 


nonparametrie approx with uniform kernel and sample averages 

Eq (24) 


parametrie approx/synthetie (log) likelihood with regression 

See 5.2 

f (^) 

nonparametrie approx with kernel n and regression 

See 5.3 

J (^) 

nonparametrie approx with uniform kernel and regression 

See 5.3 


Table 1: The main (approximate) likelihood funetions appearing in the paper. The superseript “N” 
indieates that the sample average is eomputed using N simulated data sets per model 
parameter 6. The superseript “{t)” indieates that regression is performed with a training 
set eontaining t simulated data sets. The parametrie approximations will be used together 
with the Gaussian and the Rieker model, the nonparametrie approximations together with 
the Gaussian and the day eare eenter model. 


After construction of L, inference can be performed in the usual manner by replacing C 
with L. Approximate posterior inference can be performed via Markov chain Monte Carlo 
(MCMC) algorithms or via an importance sampling approach (see, for example, Robert and 
Casella, 2004). The posterior expectation of a function g{6) given yo can be computed via 
importance sampling with auxiliary pdf q{0), 


M 

E{g{e)\yo) ^ Y. 

m=l 


W 


(m) 






2^2 = 1 


£(6»W) 


P0(e^^'>) ’ 


6 »(™) q{e), ( 10 ) 


where pg denotes the prior pdf. This approach also yields an estimate of the posterior 
distribution via the “particles” 0^™) and the associated weights A computable version 

is obtained by replacing C with L, giving E{g{d)\yo) ~ E(5((0)|<ho), 


M 

m=l 


W 


(m) 






2 ^ 2=1 


L(0W) 


g(6»(»)) 


0M g(0). (11) 
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There is some flexibility in the choice of the auxiliary pdf q{6) in Equations (10) and (11) 
which enables iterative adaptive algorithms where the accepted 0 ^™) of one iteration are 
used to define the auxiliary distribution q{G) of the next iteration (population or sequential 
Monte Carlo algorithms, Cappe et ah, 2004; Del Moral et ah, 2006). 

3.2 Parametric Approximation of the Likelihood 

The pdf p $|0 of the summary statistics is of unknown analytical form but it may be reason¬ 
ably assumed that it belongs to a certain parametric family. For instance, if is obtained 
via averaging, the central limit theorem suggests that the pdf may be well approximated 
by a Gaussian distribution if the number of samples n is sufficiently large, 

(2T)-/^|detS«|V2 ■('A - /^«)) , (12) 

where p is the dimension of The corresponding likelihood function is Lg = exp(.^ 5 ), 

4(0) = log(27r) - ^ log I det - ^(4>o - ^(4>o - (13) 

which is an approximation of L{0) unless the summary statistics are indeed Gaussian. The 
mean fig and the covariance matrix are generally not known. But the simulator can be 
used to estimate them via a sample average over N independently generated summary 
statistics. 


p,g = E^ [CD,] = 


1 


N 

2=1 


4> 


(2) i.i.d. 

e ~ 


P^\e, 


£0 — E 


N 


(4>0 - A0)(^e - P'0 


,T 


(14) 


A computable estimate of the likelihood function L{6) is then given by = exp(£^), 

isiO) = log(27r) - ^ log I det fig)'^t-\^, - fig). (15) 

This approximation was named synthetic likelihood (Wood, 2010), hence our subscript “s”. 

Due to the approximation of the expectation with a sample average, is a stochastic 
process (a random function). We illustrate this in Example 4 below. We there also show 
that the number of simulated summary statistics (data sets) Ai is a trade-off parameter: The 
computational cost decreases as N decreases but the variability of the estimate increases 
as a consequence. It further turns out that the sample curves of may not be smooth for 
finite N and that decreasing N may worsen their roughness. We illustrate this in Example 
5 using the Ricker model. 

Example 4 (Synthetic likelihood for the mean of a normal distribution). The sample 
average is a sufficient statistic for the task of inferring the mean 6 from a sample yo = 
{y^\... of a normal distribution with assumed variance one. We thus reduce the 

observed and simulated data yo and ye to the empirical means <ho and $ 0 , respectively. 
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(a) Negative log synthetic likelihood for N = 2 


(b) Distribution of the max synthetic likelihood estimate 


Figure 2: Estimation of the mean of a Gaussian, (a) The figure shows the negative log synthetic 
likelihood It illustrates that is a random function, (b) The randomness makes the 
estimate 6 = argmaxg {6) a random variable. Its variability increases as N decreases. 


In this special case, no information is lost with the reduction to the summary statistic, that 
is, L{0) oc C{9). Furthermore, the distribution of the summary statistic is here known, 
M{9, 1/n) so that the Gaussian model assumption holds and Ls{9) = L{9). 

Using for simplicity the true variance of <he, we have {9) = —1/2 log(27r/n) —n/2(<ho — 
Since fie is an average of N realizations of fie ~ A^(0, l/{nN)), and we can write 
as a quadratic function subject to a random shift g, 

Each realization of g yields a different mapping 9 ^ which illustrates that the (log) 
synthetic likelihood is a random function. Figure 2(a) shows the 0.1 and 0.9 quantiles of 
for N = 2. The dashed curve visualizes 9 e-)• for a fixed realization of g. The 
circles show values of —£^{9) when g is not kept fixed as 9 changes. The results are for 
sample size n = 10. 

The optimizer 9 of each realization of £^ depends on 0 = <l>o — g. That is, 0 is a 
random variable with distribution AA(<l>o, l/(A^n)). In the limit of an infinite amount of 
available computational resources, that is —)• oo, g equals zero, and the distribution has 
a point-mass at 0mie = which is indicated with the black vertical line in Figure 2(b). 
As N decreases, variance is added to the point-estimate 9. This added variability is due 
to the use of finite computational resources; it does not reflect uncertainty about 9 due to 
the finite sample size n. The variability causes an inflation of the mean squared estimation 
error by a factor of (1 -f 1/A^), E((0 — 0o)^) = l/u(l -f 1/N). ▲ 

Example 5 (Synthetic likelihood for the Ricker model). Wood (2010) used the synthetic 
likelihood to perform inference of the Ricker model and other simulator-based models with 
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(a) Negative log synthetic likelihood for different N (b) Zoom for N = 500; 5,000; 50,000 

Figure 3: Using less computational resources may reduce the smoothness of the approximate likeli¬ 
hood function. The figures show the negative log synthetic likelihood for the Ricker 
model. Only the first parameter (log r) was varied, the others were kept fixed af fhe dafa 
generafing values, (a) The use of simulafions makes fhe synfhefic likelihood a sfochasfic 
process. Realizations of —If for differenf N are shown fogefher wifh fhe variabilify for 
N = 50. (b) The curves become more and more smoofh as fhe number N of simulafed 
dafa sefs increases even fhough fhe curve for N = 50,000 is still rugged. If is reasonable 
fo assume fhough fhaf fhe limif foiN^oo is smoofh. 


complex dynamics. Time series data ye = ■ ■ ■, from the Ricker model after 

some “burn-in” time Tf, were summarized in the form of the coefficients of the autocorrelation 
function and the coefficients of fitted nonlinear autoregressive models, thereby reducing the 
data to fourteen summary statistics (see the supplementary material of Wood, 2010, for 
their exact definition). 

Figure 3 shows the negative log synthetic likelihood —if for the Ricker model as a 
function of the log growth rate logr for yo in Figure 1(a). The parameters a and (p were 
kept fixed at the values Uo = 0.3 and po = 10 which we used to generate yo (log r° was 3.8). 
The figures show that the realizations of the synthetic likelihood become less smooth as N 
decreases. 

The lack of smoothness makes the minimization of the different realizations of —if 
difficult. A grid-search is feasible for very large N but this approach does not scale to 
higher dimensions. Gradient-based optimization is tricky because the functional form of 
if is unknown. Finite differences may not yield a reliable approximation of the gradient 
because of the lack of smoothness. Instead of optimizing a single realization of the objective, 
one could use an approximate stochastic gradient approach. That is, approximate gradients 
are computed with different random seeds at different values of the parameter. For small 
N, however, the gradients are unreliable so that the stepsize has to be very small, which 
makes the optimization rather costly again. To resolve the issue, we suggest a more efficient 
approach by combining probabilistic modeling with optimization. ▲ 
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3.3 Nonparametric Approximation of the Likelihood 

An alternative to assuming a parametric model for the pdf p<j,|g of the summary statistics is 
to approximate it by a kernel density estimate (Rosenblatt, 1956; Parzen, 1962; Mack and 
Rosenblatt, 1979; Wand and Jones, 1995), 

(18) 

where K is a suitable kernel and denotes empirical expectation as before, 

1 ^ 

E^[K{(f>,<!>e)] = -^i^(</>,ci>W), cj>W (19) 

2=1 

An approximation of the likelihood function L{0) is given by L^{6), 

L^{e)=E^[K{^o,^0)]. (20) 

We may re-write iP(<ho,<h) in another form as k.{Aq) where > 0 depends on <i>o and 
^ 0 , and «; is a univariate non-negative function not depending on 6. The kernels K are 
generally such that k, has a maximum at zero (the maximum may be not unique though). 
Taking the empirical expectation in Equation (20) with respect to Ag instead of ^g, we 
have L^{e) = L^{e), 


l;7{9) = e"(«(A„)|. 

( 21 ) 

As the number N grows, converges to L^, 


L^{e)=E[K{Ag)], 

( 22 ) 


which is where the empirical average E^ is replaced by the expectation E. The lim¬ 
iting approximate likelihood L^{6) does not not necessarily equal the likelihood L{6) = 
P^q{^o\0). For example, if K.{Ag) is obtained from a translation invariant kernel K, that 
is, K{Ag) = K(^o — ^e), Lk is the likelihood for a summary statistics whose pdf is obtained 
by convolving p^^g with K. 

For convex functions k, Jensen’s inequality yields a lower bound for and its loga¬ 
rithm, 

{0) > K (J^{0)) , log L^(0) > log , J^{e) = E^ [As] . (23) 

Since k is maximal at zero, the lower bound is maximized by minimizing the conditional 
empirical expectation J^{6). The advantage of the lower bound is that it can be maximized 
irrespective of k, which is often difficult to choose in practice. 

A popular choice of k for likelihood-free inference is the uniform kernel k = Ku which 
yields the approximate likelihood , 

Ku{u) = cXiQ^h){u), {9) = cP^ {Ag < h), (24) 

where the indicator function X[g /j)(n) equals one if u G [0,/i) and zero otherwise. The 
scaling parameter c does not depend on 6, and the positive scalar h is the bandwidth of 
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the kernel and acts as acceptance/rejection threshold. The approximate likelihood is 
proportional to the empirical probability that the discrepancy is below the threshold. The 
limiting approximate likelihood is denoted by L„(0), 

iu{e)=cV{^e<h). (25) 

The lower bound for convex k is not applicable but we can obtain an equivalent bound by 
Markov’s inequality, 

L^{e) = c[l-P^{A:^>h)]>c l-^E^lAe] . (26) 

The lower bound of the approximate likelihood can be maximized by minimizing J^{6) as 
for convex k. 

We illustrate the approximation of the likelihood via in Example 6 below. It is 
pointed out that good approximations are computationally very expensive because of the 
very small probability for Aq to be below small thresholds /i, or, in other words, because of 
the large rejection probability. We then use the model for bacterial infections in day care 
centers to show in Example 7 that the minimizer of J^{0) can provide a good approximation 
of the maximizer of (6). This is important because does not require choosing the 
bandwidth h or involve any rejections. 

Example 6 (Approximate likelihood for the mean of a Gaussian). Eor the inference of 
the mean of a Gaussian, we can use as discrepancy the squared difference between the 
empirical mean of the observed and simulated data yo and yo, that is the squared difference 
between the two summary statistics ‘ho and in Example 4: = ($o ~ ^e)^- Because of 

the use of simulated data, like the synthetic likelihood, the discrepancy Aq is a stochastic 
process. We visualize its distribution in Eigure 4(a). The observed data yo were the same 
as in Example 4. 

Eor this simple example, we can compute the limiting approximate likelihood in 
Equation (25) in closed form, 

Lu{0) oc F — 0) + Vnh^ — F — 0) — Vnh^ , (27) 

where F(x) is the cumulative distribution function (cdf) of a standard normal random 
variable. 



Eor small nh, Lu{6) becomes proportional to the likelihood L(9). This is visualized in 
Eigure 4(b).^ However, the probability to actually observe a realization of Ag which is 
below the threshold h becomes vanishingly small. Eor realistic models, is not available 
in closed form but needs to be estimated. The vanishingly small probability indicates that 
the inference procedure will be computationally expensive when is estimated via the 
sample average approximation L^. ▲ 

Example 7 (Approximate univariate likelihoods for the day care centers). In the model 
for bacterial infections in day care centers, the observed data were converted to summary 

2. Using h = Q.l for illustrative purposes. For threshold choice in real applications, see Example 7 and Section 5.3. 
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(a) Distribution of the discrepancy 


(b) Approximate likelihood 


Figure 4: Nonparametric approximation of the likelihood to estimate the mean 0 of a Gaussian. The 
discrepancy Ag is the squared difference between the empirical means of the observed 
and simulated data, (a) The discrepancy is a random function, (b) The probability that the 
discrepancy is below some threshold h approximates the likelihood. Note the different 
range of the axes. 


statistics 4>o by representing each day care center (binary matrix) with four statistics. This 
gives 4 • 29 = 116 summary statistics in total (see Numminen et al., 2013, for details). 

Since the day care centers can be considered to be independent, the 29 observations can 
be used to estimate the distribution of the four statistics and their cdfs. Numminen et al. 
(2013) assessed the difference between and by the Li distance between the estimated 
cdfs. Each Li distance had its own uniform kernel and corresponding bandwidth, which 
means that a product kernel was used overall. We here work with a simplified discrepancy 
measure: The different scales of the four statistics were normalized by letting the maximal 
value of each of the four statistics be one for yo. The discrepancy was then the Li norm 
between and <ho divided by their dimension, Ag = 1 /I 16 ||<h 0 — 4>o||i. 

Figure 5 shows the distributions of the discrepancies Ag if one of the three parameters 
is varied at a time. The results are for the real data used by Numminen et al. (2013). 
The parameters were varied on a grid around the (rounded) mean (3.6, 0.6,0.1) which was 
inferred by Numminen et al. (2013). The distributions were estimated using N = 300 
realizations of Ag per parameter value. The red solid lines show the empirical average 

. The black lines with circles show with bandwidths (thresholds) equal to the 0.1 
quantile of the sampled discrepancies. While subjective, this is a customary choice (Marin 
et al., 2012). The thresholds were hp = 1.16, h\ = 1.18, and hg = 1.20, and are marked 
with green lines. It can be seen that the optima of and are attained at about 
the same parameter values which is advantageous because is independent of kernel and 
bandwidth. 

Since the functional form of and its gradients are, however, not known, the mini¬ 
mization becomes a difficult problem in higher dimensions. We will show that the idea of 
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(a) Internal infection parameter /3 


(b) External infection parameter A 


(c) Co-infection parameter 9 


Figure 5: Approximate likelihoods and distributions of the discrepancy for the day care 
center example. The green horizontal lines indicate the thresholds used. The optima 
of the average discrepancies and the approximate likelihoods occur at about the same 
parameter values. 


combining probabilistic modeling with optimization, which we mentioned in Example 5 for 
the log synthetic likelihood, is also helpful here. A 


3.4 Relation between Nonparametric and Parametric Approximation 


Kernel density estimation with Gaussian kernels is interesting for two reasons in the context 
of likelihood-free inference. First, the Gaussian kernel is positive definite, so that the 
estimated density is a member of a reproducing kernel Hilbert space. This means that more 
robust approximations of than the one in Equation (18) would exist (Kim and Scott, 
2012), and that there might be connections to the inference approach of Fukumizu et al. 
(2013). Second, it allows us to embed the synthetic likelihood approach of Section 3.2 into 
the nonparametric approach of Section 3.3. 

For the Gaussian kernel, we have that K{^o, ^e) = Kg{^o — 


Kg{<^o - ‘he) = 


1 


1 


(27r)P/'^ I det 


exp 




(29) 


where Cg is a positive definite bandwidth matrix possibly depending on 6. The kernel Kg 
corresponds to k = Kg and Ag = A^, 


= (27r^)p/2 i~T) ’ = log|detC0 |-h(d>o-d>0 )’^C0 ^(d>o-^>0). (30) 

The function Kg is convex so that Equation (23) yields a lower bound for L^{6) = (0) 

and its logarithm. 


logL^(0) > -|log(27r) - ^Jf(0), 

TN/'n\ _ tt'A^ 


J^^(0) = E^'' log I det Cel + (<!>« - cbe)^Cg - ‘he) 
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We used the subscript “g” to highlight that in Equation (23) is computed for the 
particular discrepancy Ag. The form of is reminiscent of the log synthetic likelihood 
in Equation (15). The following proposition shows that there is indeed a connection. 

Proposition 1 (Synthetic likelihood as lower bound). For Cq = 


^TW = |-|log(27r)-^jf(0), 

(33) 

logLf(0 )>-|+hf(0) 

(34) 


The proposition is proved in Appendix A. It shows that maximizing the synthetic log- 
likelihood corresponds to maximizing a lower bound of a nonparametric approximation 
of the log likelihood. The proposition embeds the parametric approach to likelihood ap¬ 
proximation conceptually in the nonparametric one and shows furthermore that can be 
computed via an empirical expectation over A^. 


3.5 Posterior Inference using Sample Average Approximations of the Likelihood 

Several computable approximations L of the likelihood L were constructed in the previous 
two sections. Table 1 provides an overview. Intractable expectations were replaced with 
sample averages using N simulated data sets which we denoted by the superscript “A” in 
the symbols for the approximations. 

Wood (2010) used the synthetic likelihood together with a Metropolis MCMC al¬ 
gorithm for posterior computations. We here focus on posterior inference via importance 
sampling. Using as L in Equation (11), we have 


M 

e(5(0)|^o)^ 

m=l 


W. 


(m) 




(jm) ^ 

0 ) q(0(™)) 


l^i=l 


Ef=iX[0,,)(A^^ 


(A) 


g(0«) 


(35) 


where ^[o^h) is the indicator function of the interval [0,/i) as before, and the j = 

1,..., A, are the observed discrepancies for the sampled parameter ~ q{^)- Instead of 
sampling several discrepancies for the same sampling M' pairs (Ag\ 0^*^) with A = 1 
is also possible and corresponds to an asymptotically equivalent solution. With q = pe, the 
approximation is a Nadaraya-Watson kernel estimate of the conditional expectation (see, 
for example, Wasserman, 2004, Chapter 21). 

Approximate Bayesian computation (ABC) is intrinsically linked to kernel density es¬ 
timation and kernel regression (Blum, 2010). A basic ABC rejection sampler (Pritchard 
et ah, 1999; Marin et ah, 2012, Algorithm 2) is obtained from Equation (35) with A = 1, 
q = Pe, and Ag = ||<ho — 4>0|| where ||.|| is some norm. Approximate samples from the 
posterior pdf of 6 given 4>o can thus be obtained by retaining those for which the 
are within distance h from 4>o. In an iterative approach, the accepted particles can be used 
to define the auxiliary pdf q{9) of the next iteration by letting it be a mixture of Gaussians 
with weights center points 6^'^\ and a covariance determined by the (Beaumont 

et ah, 2009). This gives the population Monte Carlo (PMC) ABC algorithm (Marin et ah, 
2012 , Algorithm 4). Related sequential Monte Carlo (SMC) ABC algorithms were proposed 
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by Sisson et al. (2007) and Toni et al. (2009). Working with q = pe, Beaumont et al. (2002) 
introduced ABC with more general kernels, which corresponds to using instead of . 

Example 6 showed that approximating the likelihood via sample averages is computa¬ 
tionally expensive because of the required small thresholds. The auxiliary pdf q{6) specifies 
where in the parameter space the likelihood is predominantly evaluated. The following ex¬ 
ample shows that avoiding regions in the parameter space where the likelihood is vanishingly 
small allows for considerable computational savings. 

Example 8 (Univariate approximate posteriors for the day care centers). For the inference 
of the model of bacterial infections in day care centers, Numminen et al. (2013) used uniform 
priors for the parameters G (0,11), A G (0,2), and 0 G (0,1). The likelihoods shown 
in Figure 5 are thus proportional to the posterior pdfs. The posterior pdfs of the univariate 
unknowns are conditional on the remaining fixed parameters. For example, the posterior 
pdf for j3 is conditional on (A,0) = {Ao,9o) = (0.6,0.1). In Section 7, we consider inference 
of all three parameters at the same time. 

In Figure 5, each parameter is evaluated on a sub-interval of the domain of the prior. The 
sub-intervals were chosen such that the far tails of the likelihoods were excluded. Parameter 
/3, for example, was evaluated on the interval (1.5,5.5) only. Evaluating the discrepancy 
Ag on the complete interval (0,11) is not very meaningful since the probability that it is 
above the chosen threshold is vanishingly small outside the interval (1.5, 5.5). In fact, out 
of M = 5,000 discrepancies which we simulated for f3 uniformly on (0,11), not a single 
one was accepted for /3 ^ (1.5,5.5). Hence, taking for instance a uniform distribution on 
(1.5,5.5) instead of the prior as auxiliary distribution leads to considerable computational 
savings. Motivated by this, we propose a method which automatically avoids regions in the 
parameter space where the likelihood is vanishingly small. A 

4. Computational Difficulties in the Standard Inference Approach 

We have seen that the approximate likelihood functions L{6) which are used to infer 
simulator-based statistical models are stochastic processes indexed by the model param¬ 
eters 0. Their properties, in particular their functional form and gradients, are generally 
not known; they behave like stochastic black-box functions. The stochasticity is due to 
the use of simulations to approximate intractable expectations. In the standard approach 
presented in the previous section, the expectations are approximated by sample averages so 
that a single evaluation of L requires the simulation of N data sets. The standard approach 
makes minimal assumptions but suffers from a couple of limiting factors. 

1. There is an inherent trade-off between computational and statistical efficiency: Re¬ 
ducing N reduces the computational cost of the inference methods, but it can also 
decrease the accuracy of the estimates (Figure 2). 

2. For finite N, the approximate likelihoods may not be smooth (Figure 3). 

3. Simulating N data sets uniformly in the parameter space is an inefficient use of compu¬ 
tational resources and particularly costly if simulating a single data set already takes 
a long time. In some regions in the parameter space, far fewer simulations suffice to 
conclude that it is very unlikely for the approximate likelihood to take a significant 
value (Figures 2 to 5). 
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5. Framework to Increase the Computationally Efficiency 

We present a framework which combines optimization with probabilistic modeling in order 
to increase the efficiency of likelihood-free inference of simulator-based statistical models. 

5.1 From Sample Average to Regression Based Approximations 

The standard approach to obtain a computable approximate likelihood function L relies 
on sample averages, yielding the parametric approximation = exp(.^^) in Equation 
(15) or the nonparametric approximation in Equation (21). The approximations are 
computable versions of Lg = exp(£ 5 ) in Equation (13) and in Equation (22), which both 
involve intractable expectations. But sample averages are not the only way to approximate 
the intractable expectations. We here consider approximations based on regression. 

Equation (22) shows that L^{6) has a natural interpretation as a regression function 
where the model parameters 6 are the covariates (the independent variables) and K{Ag) is 
the response variable. The expectation can thus also be approximated by solving a regression 
problem. Eurther, in Equation (23) can be seen as the sample average approximation 
of the regression function J{0), 

Jie)=E[Ag], (36) 

where the discrepancy is the response variable. The arguments which we used to show 
that provides a lower bound for carry directly over to J and L^: J provides a lower 
bound for if k is convex or the uniform kernel. 

Proposition 1 establishes a relation between the sample average quantities in Equa¬ 
tion (32) and in Equation (15). In the proof of the proposition in Appendix A, we show 
that the relation extends to the limiting quantities Jg{0) = E [A^] and ig in Equation (13). 
Thus, for Cg = Tig and up to constants and the sign, ig{6) can be seen as a regression 
function with the particular discrepancy Ag as the response variable. 

We next discuss the general strategy to infer the regression functions while avoiding 
unnecessary computations. Eor nonparametric approximations to the likelihood, inferring 
J is simpler than inferring since the function k and its corresponding bandwidth do 
not need to be chosen. We thus propose to first infer the regression function J of the 
discrepancies and then, in a second step, to leverage the obtained solution to infer L^. Eor 
the parametric approximation to the likelihood, this extra step is not needed since Jg is a 
special instance of the regression function J. 

5.2 Inferring the Regression Function of the Discrepancies 

Inferring J{0) via regression requires training data in the form of tuples (0^*\ A^*^). Since 
we are mostly interested in the region of the parameter space where Ag tends to be small, 
we propose to actively construct the training data such that they are more densely clustered 
around the minimizer of J{0). As J{0) is unknown in the first place, our proposal amounts 
to performing regression and optimization at the same time: Given an initial guess that the 
minimizer is in some bounded subset of the parameter space, we can sample some evidence 
of the relation between 6 and Ag, 

£:W = {(0«,AW),...,(0W,Ai*))}, (37) 
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and use this evidence to obtain an estimate of J via regression. The estimated and 
some measurement of uncertainty about it can then be used to produce a new guess about 
the potential location of the minimizer, from where the process re-starts. In some cases, it 
may be advantageous to include the prior pdf of the parameters in the process. We explore 
this topic in Appendix B. 

The evidence set grows at every iteration and we may stop at t = T. The value of 
T can be chosen based on computational considerations, by checking whether the learned 
model predicts the acquired points reasonably well, or by monitoring the change in the 
minimizer ey of jw as the evidence set grows, 

= argmin^ (38) 

Given our examples so far, it is further reasonable to assume that J is a smooth func¬ 
tion. Even for the Ricker model, the mean objective was smooth although the individual 
realizations were not (Figure 3). The smoothness assumption about J can be used in the 
regression and enables its efficient minimization. 

For the special case where is is the target, several observed values of Ag = Ag may 
be available for any given This is because the covariance matrix may be still 

estimated as a sample average so that multiple simulated summary statistics, and hence 
discrepancies, are available per They can be used as discussed above with the only 
minor modification that the training data are updated with several tuples at a time. But it 
is also possible to only use the average value of the observed discrepancies, which amounts to 
using the observed values of i^ for training. The estimated regression function provides 
an estimate for ig in either case. We denote the estimate by t'P and the corresponding 
estimate of Lg by . 

Combining nonlinear regression with the acquisition of new evidence in order to optimize 
a black-box function is known as Bayesian optimization (see, for example, Brochu et ah, 
2010). We can thus leverage results from Bayesian optimization to implement the proposed 
approach, which we will do in Section 6. 

5.3 Inferring the Regression Function for Nonparametric Likelihood Approximation 

The evidence set can be used in two possible ways in the nonparametric setting: The 
first possibility is to compute for each Ag^ in the value = «:(A^*^) and to thereby 
produce a new evidence set which can be used to approximate by fitting a regression 
function. The second possibility is to estimate a probabilistic model of A^ from the evidence 
The estimated model can be used to approximate by replacing the expectation in 
Equation (22) with the expectation under the model. We denote either approximation by 
Z/K ^ where the superscript “ (t) ” indicates that the approximation was obtained via regression 
with t training points. Since is such that the approximation of the regression function 
is accurate where it takes small values, the approximation of will be accurate where it 
takes large values, that is, in the modal areas. 

For nonparametric likelihood approximation, kernels and bandwidths need to be selected 
(see Section 3.3). The choice of the kernel is generally thought to be less critical than 
the choice of the bandwidth (Wand and Jones, 1995). Bandwidth selection has received 
considerable attention in the literature on kernel density estimation (for an introduction. 
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see, for example, Wand and Jones, 1995). The results from that literature are, however, 
not straightforwardly applicable to our work: We may only be given a certain discrepancy 
measure without underlying summary statistics $0 (Gutmann et al., 2014). And even 
if the discrepancy Ag is constructed via summary statistics, the kernel density estimate is 
only evaluated at 4>o which is kept fixed while 6 is varied. Furthermore, we usually only 
have very few observations available for any given 6 which is generally not the case in kernel 
density estimation. These differences warrant further investigations into which extent the 
bandwidth selection methods from the kernel density estimation literature are applicable 
to likelihood-free inference. We focus in this paper on the uniform kernel and generally 
choose h via the quantiles of the Ag , which is common practice in approximate Bayesian 
computation (see, for example, Marin et ah, 2012). The approximate likelihood function 
for the uniform kernel will be denoted by L^u ■ 

5.4 Benefits and Limitations of the Proposed Approach 

The difference between the proposed approach and the standard approach to likelihood- 
free inference of simulator-based statistical models lies in the way the intractable J and 
L are approximated. We use regression with actively acquired training data while the 
standard approach relies on computing sample averages. Our approach allows to incorporate 
a smoothness assumption about J and L in the region of their optima. The smoothness 

assumption allows to “share” observed A^ among multiple 6 which suggests that fewer 

ii) ii) 

Ag , that is, fewer simulated data sets yg , are needed to reach a certain level of accuracy. 
A second benefit of the proposed approach is that it directly targets the region in the 
parameter space where the discrepancy Ag tends to be small, which is very important if 
simulating data sets is time consuming. 

Regression and deciding on the training data are not free of computational cost. While 
the additional expense is often justified by the net savings made, it goes without saying that 
if simulating the model is very cheap, methods for regression and decision making need to be 
used which are not disproportionately costly. Furthermore, prioritizing the low-discrepancy 
areas of the parameter space is often meaningful, but it also implies that the tails of the 
likelihood (posterior) will not be as well approximated as the modal areas. The proposed 
approach thus had to be modified if the computation of small probability events was of 
primary interest. 

Section 4 lists three computational difficulties occurring in the standard approach. Our 
approach addresses the smoothness issues via smooth regression. The inefficient use of 
resources is addressed by focusing on regions in the parameter space where Ag tends to be 
small. The trade-off between computational and statistical performance is still present but 
in modified form: The trade-off is the size of the training set used in the regression. 
The regression functions can be estimated more accurately as the size of the training set 
grows but this also requires more computation. The size of the training set as trade-off 
parameter has the advantage that we are free to choose in which areas of the parameter 
space we would like to approximate the regression function more accurately and in which 
areas an accurate approximation is not needed. This is in contrast to the standard approach 
where a computational cost of N simulated data sets needs to be paid per 6 irrespective of 
its value. 
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6. Implementing the Framework with Bayesian Optimization 

We start with introducing Bayesian optimization and then use it to implement our frame¬ 
work. This is followed by a discussion of possible extensions. 

6.1 Brief Introduction to Bayesian Optimization 

We briefly introduce the elements of Bayesian optimization which are needed in the paper. A 
more thorough introduction can be found in the review articles by Jones (2001) and Brochu 
et al. (2010). While the presented version of Bayesian optimization is rather straightforward 
and textbook-like, our framework can also be implemented with more advanced versions, 
see Section 6.4. 

Bayesian optimization comprises a set of methods to minimize black-box functions f{0). 
With a black-box function, we mean a function which we can evaluate but whose form and 
gradients are unknown. The basic idea in Bayesian optimization is to use a probabilistic 
model of / to select points where the objective is evaluated, and to use the obtained values 
to update the model by Bayes’ theorem. 

The objective / is often modeled as a Gaussian process which is also done in this paper: 
We assume that / is a Gaussian process with prior mean function m{6) and covariance 
function k{0, O') subject to additive Gaussian observation noise with variance ci^. The joint 
distribution of / at any t points ..., 0^*^ is thus assumed Gaussian with mean and 
covariance Kt, 

(/«,...,/W)^~AA(m4,Kt), (39) 





/A:(6»W,6>W) . 



mt = 


, Ki = 



+ 





.. k{Q^'^,0^^)) 



(40) 


We used /O) to denote /(0^*^) and 1^ is the t X t identity matrix. While other choices are 
possible, we assume that m{0) is either a constant or a sum of convex quadratic polyno¬ 
mials in the elements 6j of 0, cross-terms were not included, and that k{0, O') is a squared 
exponential covariance function. 


m{0) = ^ UjO'j -I- bjOj -I- c, 
j 



(41) 


These are standard choices (see, for example, Rasmussen and Williams, 2006, Ghapter 
2). Since we are interested in minimization, we constrain the aj to be non-negative. In 
the last equation, 6j and O'j are the elements of 0 and O', respectively, ajr is the signal 
variance, and the Xj are the characteristic length scales. The length scales control the 
amount of correlation between f{0) and f{0'), in other words, they control the wiggliness 
of the realizations of the Gaussian process. The signal variance is the marginal variance of 
/ at a point 0 if the observation noise was zero. 

The quantities aj, bj, c, aj:, Xj, and are hyperparameters. For the results in this 
paper, they were learned by maximizing the leave-one-out log predictive probability (a form 
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of cross-validation, see Rasmussen and Williams, 2006, Section 5.4.2). The hyperparameters 
were slowly updated as new data were acquired, as done in previous work, for example by 
Wang et al. (2013). This yielded satisfactory results but there are several alternatives, 
including Bayesian methods to learn the hyperparameters (for an overview, see Rasmussen 
and Williams, 2006, Chapter 5), and we did not perform any systematic comparison. 

Given evidence ,..., Z*-*^)}, the posterior pdf of / at a point 6 

is Gaussian with posterior mean Ht{d) and posterior variance vt{0) + 

irnsf + (42) 

where (see, for example, Rasmussen and Williams, 2006, Section 2.7), 

= m{e) + - m^), vt{e) = k{o, e) - kt{e)^K^^kt{0), (43) 

fi = (/(!),..., /W)^, ktie) = {k{e, 0(1)), ..., k{e, 0(‘)))^. (44) 

The posterior mean nt emulates / and can be minimized with powerful gradient-based 
optimization methods. 

The evidence set can be augmented by selecting a new point where / is next 

evaluated. The point is chosen based on the posterior distribution of / given SP ■ While 
other choices are equally possible, we use the acquisition function At{0) to select the next 
point, 

At{0) = - pppPiff), (45) 

where rj^ = 2log[t'^/^+^7r^/(3e,,)] with er^ being a small constant (we used = 0.1). This ac¬ 
quisition function is known as the lower confidence bound selection criterion (Cox and John, 
1992, 1997; Srinivas et al., 2010, 2012).^ Classically, 0(i'’'i) is chosen deterministically as the 
minimizer of At{0). The minimization of At{0) yields a compromise between exploration 
and exploitation: Minimization of the posterior mean /Uj(0) corresponds to exploitation of 
the current belief and ignores its uncertainty. Minimization of —\/vt(0), on the other hand, 
corresponds to exploration where we seek a point where we are uncertain about /. The 
coefficient rjt implements the trade-off between these two desiderata. 

There is usually no restriction that 0(*+^) must be different from previously acquired 
0(*). We found, however, that this may result in a poor exploration of the parameter space 
(see Figure 7 and Example 10 below). Employing a stochastic acquisition rule avoids getting 
stuck at one point. We used the simple heuristic that is sampled from a Gaussian with 

diagonal covariance matrix and mean equal to the minimizer of the acquisition function. 
The standard deviations were determined by finding the end-points of the interval where the 
acquisition function was within a certain (relative) tolerance. Other stochastic acquisition 
rules, like for example Thompson sampling (Thompson, 1933; Chapelle and Li, 2011; Russo 
and Van Roy, 2014), could alternatively be used. 

The algorithm was initialized with an evidence set Sp^ where the parameters ..., 0(*°) 
were chosen as a Sobol quasi-random sequence (see, for example, Niederreiter, 1988). Com¬ 
pared to uniformly distributed (pseudo) random numbers, the Sobol sequence covers the 

3. In the literature, maximization instead of minimization problems are often considered. For maximization problems, 
the acquisition function becomes /it(0) + yjrjlvtid) and needs to be maximized. The formula for 77 ^ is used in the 
review by Brochu et al. (2010) and is part of Theorem 2 of Srinivas et al. (2010). 
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parameter space in a more even fashion. This kind of initialization is, however, not critical 
to our approach, and only few initial points were used in our simulations. 

6.2 Inferring the Regression Function of the Discrepancies 

Letting f{0) = Ag, Bayesian optimization yields immediately an estimate of J{0) in Equa¬ 
tion (36). Since Ag is non-negative, working with / = logA^ seems to be theoretically 
more sound. In practice, however, both approaches were found to work well, albeit we do 
not aim at any systematic comparison here. If / = Ag, the estimate of J is given by 
the posterior mean fit, and if / = log A^, the estimate is given by the mean of a log-normal 
random variable. 




if/(0) = A^, 

exp -h -h cr^)) if/(0) = log Aq. 


(46) 


As discussed in Section 5.2, in the parametric approach to likelihood approximation, 
equals the computable approximation l^s'^ of is- 

We illustrate the basic principles of Bayesian optimization in Example 9 below. In 
Example 10, we illustrate log-Gaussian modeling and the stochastic acquisition rule. 

Example 9 (Bayesian optimization to infer the mean of a Gaussian). Eor inference of 
the mean of a univariate Gaussian, the squared difference of the empirical means was used 
as the discrepancy measure A^, as in Example 6. We modeled the discrepancy Ag as a 
Gaussian process with constant prior mean and performed Bayesian optimization with the 
deterministic acquisition rule. Eigure 6 shows the first iterations: When only a single obser¬ 
vation of Ag is available, t = 1 and A^ is believed to be constant but there is considerable 
uncertainty about it (upper-left panel). The posterior distribution of the Gaussian process 
yields the acquisition function Ai[6) according to Equation (45) (curve in magenta). Its 
minimization gives the value 0*-^^ where Ag is evaluated next (blue rectangle). After includ¬ 
ing the observed value of A^ into the evidence set, t = 2 and the posterior distribution of 
the Gaussian process is re-calculated using Equation (42), that is, the belief about Ag is 
updated using Bayes’ theorem (upper-right panel). The updated belief becomes the cur¬ 
rent belief and the process restarts. A movie showing the process over several iterations is 
available at http://www.cs.helslnki.fi/u/gutmann/material/BOLFI/movies/Gauss.avi.A 

Example 10 (Bayesian optimization to infer the growth rate in the Ricker model). Exam¬ 
ple 5 introduced the synthetic likelihood for the Ricker model. We have seen that individual 
realizations of i^ are rather noisy, in particular for N = 50, but that their average, which 
represents an estimate of ig, is smooth with its optimum in the right region (Eigure 3). We 
here obtain estimates i^'^ of is with Bayesian optimization. The maximal training data are 
T = 150 tuples (logr, £^), where the first nine are from the initialization. The log synthetic 
likelihood was computed using code of Wood (2010) which only returned i^ and not the 
multiple discrepancies prior to averaging. 

Eigures 7(a) and (b) show —after initialization without and with log-transformation, 
respectively (black solid lines). In both cases, we used a quadratic prior mean function. The 
estimated limiting negative log synthetic likelihood —is from Eigure 3 is shown in red for 
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Start: 1 data point 2 data points 



3 data points 



Exploration exploitation 



Acquisition 
of new data 


Figure 6: The first iterations of Bayesian optimization to estimate the mean of a Gaussian. The 
distribution of and its regression function J{9) are reproduced from Figure 4 for ref¬ 
erence (labeled “Target”). Bayesian optimization consists in acquiring new data based 
on the current belief, followed by an update of the belief by Bayes’ theorem. The ac¬ 
quisition of new data is based on an acquisition function which implements a trade-off 
between exploration and exploitation. Exploitation after two data points would consist in 
evaluating the objective again at 0 = 5. Exploration would consist in evaluating it where 
the posterior variance is large, that is, somewhere between minus five and zero. The point 
selected (blue rectangle) strikes a compromise between the two extremes. 


reference. Figure 7(c) shows that the deterministic decision rule can lead to acquisitions 
with very little spatial exploration. The reason for the poor exploration is presumably the 
rather large variance of for N = 50. Working with a log-Gaussian process leads to 
a better exploration and also to a better approximation (Figure 7(d)). The acquisitions 
happen, however, still in a cluster-like manner, which can also be seen in Figure 14 in 
Appendix C where we provide a more detailed analysis. Working with a stochastic decision 
rule leads to acquired points which are spread out more evenly in the area of interest. This 
results in both more stable and more accurate approximations (Figures 7(e-f) and Figure 
15 in Appendix C). ▲ 


22 


















neg log synthetic likelihood neg log synthetic likelihood 


Bayesian Optimization for Likelihood-Free Inference 




(a) Gaussian process model, initial 


(b) log-Gaussian process model, initial 






Figure 7: Approximation of the limiting negative log synthetic likelihood —Is for the Ricker ex¬ 
ample. The approximations are shown as black solid curves. The black dashed curves 
indicate the variability of and the red curves show —Is from Figure 3 for refer¬ 
ence. (a-b) The approximation after initialization with 9 data points. The green vertical 
lines indicate the minimizer of the acquisition function. The dashed vertical lines show 
the mean plus-or-minus one standard deviation in the stochastic decision rule, (c-f) The 
approximations are based on 150 data points (blue circles). 
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6.3 Model-Based Nonparametric Likelihood Approximation 

Bayesian optimization yields a probabilistic model for the discrepancy Ag. As discussed in 
Section 5.3, we can use this model to obtain the computable likelihood approximation Lu\ 



i p' ( h—gtjO) \ 

J 

I ^ / log h-gt{0) \ 

[ ) 


if/(0) = Ae, 
if/(0) = logAe, 


(47) 


where h is the bandwidth (threshold). The function F{x) was defined in Equation (28) and 
denotes the cdf of a standard normal random variable, and nt and vt + are the posterior 
mean and variance of the Gaussian process. 

Both in the nonparametric approach and = exp{i^^) in the parametric approach 
are computable approximations L of the likelihood L. Evaluating them is cheap since no 
further runs of the simulator are needed. Derivatives can also be computed since the 
derivatives of the posterior mean and variance are tractable for Gaussian processes. A 
given approximate likelihood function can thus be used in various ways for inference: We 
can maximize it and compute its curvature (Hessian matrix) to obtain error bars, we can 
perform inference with a hybrid Monte Garlo algorithm in a MGMG framework, or use it 
according to Equation (11) in an importance sampling approach. 

Eor the results in this paper, we used iterative importance sampling where in each 
iteration, the auxiliary pdf q was a mixture of Gaussians as in Section 3.5. The initial 
auxiliary pdf was defined as a mixture of Gaussians in the same manner by associating 
uniform weights with the 0^*^ acquired in the Bayesian optimization step. Samples from the 
prior pdf pg are not needed in such an approach, which can be advantageous if obtaining 
them is expensive. 

We next illustrate model-based likelihood approximation using the example about bac¬ 
terial infections in day care centers. 

Example 11 (Model-based approximate univariate likelihoods for the day care centers). We 
inferred the likelihood function via Bayesian optimization using a Gaussian process model 
with quadratic prior mean and T = 50 data points (10 initial points and 40 acquisitions). 
The bandwidths and general setup were as in Example 7. The left column of Eigure 8 shows 
the estimated models of the discrepancies for the different parameters and compares them 
with the empirical distributions reported in Eigure 5. The right column of Eigure 8 shows 
the estimated likelihood functions Lu\ t = 50 (blue solid curves), and compares them with 
the sample average based approximations from Eigure 5 (black, dots). Eor Bayesian 
optimization, the computational cost for an entire likelihood curve was 50 simulations. This 
is in stark contrast to the computational cost oi N = 300 simulations for a single evaluation 
of in the sample-based approach. Since was evaluated on a grid of 50 points, the 
model-based results required 300 times fewer simulations. The computational savings were 
achieved through the use of smooth regression and the active construction of the training 
data in Bayesian optimization. ▲ 
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(a) Internal infection parameter /3 




(b) External infection parameter A 




(c) Co-infection parameter 6 

Figure 8: Distributions of the discrepancies and approximate likelihoods for the day care center 
example. For reference, the sample average results are reproduced from Figure 5. In 
the standard sample average approach, each likelihood curve required 15,000 simulations 
(right column, black lines with markers). In the proposed model-based approach, each 
likelihood curve required 50 simulations (right column, blue solid lines). This yields a 
factor of 300 in computational savings. 
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6.4 Possible Extensions 

We use in this paper a basic version of Bayesian optimization to do likelihood-free inference. 
But more advanced versions exist which opens up a range of possible extensions. 

6.4.1 Scalability with the Number of Acquisitions 

The straightforward approach of Section 6.1 to Bayesian optimization with a Gaussian 
process model requires the inversion of the t x t matrix K^. The inversion has complexity 
0{t^) which limits the number of acquisitions to a few thousands. For the applications in 
this paper, this has not been an issue but we would like to be able to acquire more than a 
few thousand points if necessary. 

Research on Gaussian processes has produced numerous methods to deal with the inver¬ 
sion of Kt (for an overview, see Rasmussen and Williams, 2006, Ghapter 8). Importantly, 
we can directly use any of these methods for the purpose of likelihood-free inference. For ex¬ 
ample, sparse Gaussian process regression employs m <t “inducing variables” to reduce the 
complexity from 0{t^) to 0{tm?) (see, for example, Quihonero-Gandela and Rasmussen, 
2005). The inducing variables and the hyperparameters of the Gaussian process can be 
optimized using variational learning (Titsias, 2009), which is also amenable to stochastic 
optimization to further reduce the computational cost (Hensman et ah, 2013). 

An alternative approach to Gaussian process regression is Bayesian linear regression 
with a set of m < t suitably chosen basis functions. The two approaches are closely related 
(see, for example, Rasmussen and Williams, 2006, Ghapter 2), but instead of a t x t matrix, 
a m X m matrix needs to be inverted. This reduces the computational complexity again 
to In order to keep the number of required basis functions small, adaptive basis 

regression with deep neural networks has been employed to perform Bayesian optimization 
(Snoek et ah, 2015). 

6.4.2 High-Dimensional Inference 

Likelihood-free inference is in general very difficult when the dimensionality d of the pa¬ 
rameter space is large. This difficulty manifests itself in our approach in the form of a 
nonlinear regression problem which needs to be solved. While we are only interested in 
accurate regression results in the areas of the parameter space where the discrepancy is 
small, discovering these areas becomes more difficult as the dimension increases. 

In general, more training data are needed with increasing dimensions so that a method 
which can handle a large number of acquisitions is likely required (see above). Furthermore, 
the optimization of acquisition functions is also more difficult in higher dimensions. 

Bayesian optimization in high dimensions typically relies on structural assumptions 
about the objective function. In recent work, it was assumed that the objective varies 
along a low dimensional subspace only (Ghen et ah, 2012; Wang et ah, 2013; Djolonga 
et ah, 2013), or that it takes the form of an additive model (Kandasamy et ah, 2015). This 
work and further developments in high-dimensional Bayesian optimization can be leveraged 
for the challenging problem of high-dimensional likelihood-free inference. 
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6.4.3 Parallelization and Acquisition Rules 

Bayesian optimization lends itself to parallelization. In particular the acquisition of new 
data points can be performed in parallel. While several well-known acquisition rules are 
sequential, they can also be parallelized. Our stochastic acquisition rule provides an easy 
mechanism by using a sequential rule to define a probability distribution for the location of 
the next acquisition. Several points can then be drawn in parallel from that distribution. 
We employ the lower confidence bound selection criterion in Equation (45) to drive the 
stochastic acquisitions, but alternative rules, for example the maximization of expected 
improvement, can be used in an analogous way. Other stochastic acquisition rules, like for 
instance Thompson sampling (Thompson, 1933; Chapelle and Li, 2011; Russo and Van Roy, 
2014), enable similarly the concurrent acquisition of multiple data points. 

A more elaborate way to parallelize a sequential acquisition rule is to design the joint 
acquisitions such that the resulting algorithm behaves as if the points are chosen sequentially 
(Azimi et ah, 2010), or to integrate out the possible outcomes of the pending function 
evaluations (Snoek et ah, 2012). Moreover, parallel versions of the lower/upper confidence 
bound criterion have been proposed by Cental et al. (2013) and Desautels et al. (2014). 

In most theoretical studies on acquisition rules, the objective function in Bayesian op¬ 
timization is modeled as a Gaussian process with uncorrelated Gaussian observation noise. 
The distribution of the (log) discrepancy, however, may not follow this assumption. This 
implies on the one hand that the probabilistic modeling of the discrepancy could be im¬ 
proved (see below). On the other hand, it also means that further research would be needed 
about optimal acquisition rules in the context of likelihood-free inference. 

6.4.4 Probabilistic Model 

We modeled the discrepancy as a Gaussian or log Gaussian process using a squared 
exponential covariance function and uncorrelated Gaussian observation noise. While simple 
and often used, we are not limited to these choices. The literature on Gaussian process 
regression and Bayesian optimization provides several alternatives and extensions (for an 
overview, see Rasmussen and Williams, 2006). Modeling of Ag is important because the 
model affects the inferences made. 

In the employed model, a stationary prior distribution is assumed. However, depending 
on the simulator, the discrepancy may behave differently in different parameter regions. 
In particular its variance may be input dependent (heteroscedasticity). Such cases can be 
handled by non-stationary covariance functions or by using different stationary processes in 
different regions of the parameter space (see, for example, Rasmussen and Williams, 2006, 
Ghapters 6 and 9). 

Equation (42) shows that for the Gaussian process model, the posterior variance does 
not depend on the observed function values but only on the acquisition locations. As 
more points are acquired in a neighborhood of a point, the posterior variance may shrink 
even if the observed function values have a larger than expected spread. A dependency 
on the observed values can be obtained indirectly by updating the hyperparameters of the 
covariance function. But a more direct dependency may be preferable. An option is to use 
Student’s t processes instead where the posterior variance depends on the observed function 
values through a global scaling factor (Shah et ah, 2014). 
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7. Applications 

We here apply the developed methodology to infer the complete Ricker model and the 
complete model of bacterial infections in day care centers. As in the previous section, using 
Bayesian optimization in likelihood-free inference (BOLFI) reduces the amount of required 
simulations by several orders of magnitude. 

7.1 Ricker Model 

We introduced the Ricker model in Example 2. It has three parameters: logr, a, and (p. 
The difficulty in the inference stems from the dynamics of the latent time series and the 
unobserved variables. We inferred the parameters using the synthetic likelihood of Wood 
(2010) from the data shown in Figure 1(a) which were generated with Oq = (3.8,0.3,10). 

Wood (2010) inferred the model with a random walk Markov chain Monte Carlo algo¬ 
rithm using (6) with N = 500. The random walk was defined on the log-parameters 
due to their positivity. In a baseline study with the computer code made publicly available 
by Wood (2010), we were not able to infer the parameters with the settings in the orig¬ 
inal publication (Wood, 2010, Section 1.1 in the supplementary material). Reducing the 
proposal standard deviation for ct by a factor of ten enabled inference even though differ¬ 
ent Markov chains still led to rather different marginal posterior pdfs for a. These issues 
were observed for N G {500,1000,5000} and for Markov chains run twice as long as in the 
original publication (100,000 versus 50,000 iterations). In addition to the usual random 
effects in MCMC, the variability in the outcomes of the different chains may be due to his 
approach of working on a single realization of the random log synthetic likelihood function 
(see Figure 3 for example realizations when only logr is varied). The results of our baseline 
study are reported in Appendix D. Given the nature of the baseline results, we should not 
expect that the results from our method match them exactly. 

For BOLFI, we modeled the random log synthetic likelihood as a log-Gaussian pro¬ 
cess with a quadratic prior mean function (using N = 500 as Wood, 2010). Bayesian 
optimization was performed with the stochastic acquisition rule and 20 initial data points. 
Figure 9 shows for t G {50,150, 500}. The results for t = 50 and t = 500 differ more in 
the shape of the estimated regression functions than in the location of the optima. As the 
evidence set grows, the algorithm learns that the log synthetic likelihood is less confined 
along a and that the curvature along the other dimensions should be larger. The plot also 
shows that there is a negative correlation between logr and (p (conditional on a). This 
is reasonable since a larger growth rate r can be compensated with a smaller value of the 
observation scalar (p and vice versa. 

The approximation was used to perform posterior inference of the parameters via 
the iterative importance sampling scheme of Section 6.3 (using three iterations with 25,000 
samples each). This sampling is purely model-based and does not require further runs of the 
simulator. The computed marginal posterior pdfs are shown in Figure 10 (curves in gray) 
together with a MGMG solution for reference (blue dashed). It can be seen that already 
after t = 150 acquired data points, we obtain a solution which matches the MGMG solution 
well at a fraction of the computational cost. About 600 times fewer calls to the simulator 
were needed. 
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Figure 9: Isocontours of the estimated negative log synthetic likelihood function for the Ricker 
model. Each panel shows slices of —with argmax i^s'^ as center point when two of 
the three variables are varied at a time. The center points are marked with a red cross. 
The dots mark the location of the acquired parameters 0*^*^ (projected onto the plane). 
The intensity map is the same in all figures; white corresponds to the smallest value. 


The largest differences between the model-based and the MCMC solution occur for 
parameter a (Figure 10(b)). But we have seen that this is a difficult parameter to infer and 
that the MCMC solution may actually not correspond to ground truth. The two posteriors 
inferred by MCMC have, for instance, posterior means (blue diamonds) which are further 
from the data generating parameter Uq = 0.3 (green circle) than our model-based solution 
(black square). For the other parameters, the posterior means of the model-based solution 
are also closer to ground truth than the posterior means of the MCMC solution. 
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(a) log growth rate log r 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 


(b) innovation standard deviation cr 



(c) observation scalar (j) 


Figure 10: Marginal posterior pdfs for the Ricker model. The model-based solutions are shown in 
gray, the blue dashed curves are the MCMC solution. The green circles on the x-axes 
mark the location of Oq = (3.8,0.3,10). The blue diamonds mark the value of the 
posterior mean for the MCMC solution while the black squares indicate the posterior 
means of the model-based solution. For the MCMC solution, 100,000 simulated data 
sets are needed. Bayesian optimization yields informative solutions using 150 simulated 
data sets only, which corresponds to 667 times fewer simulations than with MCMC. 


7.2 Bacterial Infections in Day Care Centers 

The model for bacterial infections in day care centers was described in Example 3. It 
has three parameters of interest: /?, A, and 9. The likelihood function is intractable due 
to the infinitely many unobserved correlated variables. We inferred the model using the 
discrepancy described in Example 7 from the same real data as Numminen et al. (2013). 

Eor BOLEI, we modeled the discrepancy Ag as a Gaussian process with a quadratic 
prior mean function and used the stochastic acquisition rule. The algorithm was initial¬ 
ized with 20 data points. Eigure 11 shows the estimated regression functions for 
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t £ {50,100,150,500}. For t = 50, the optimal co-infection parameter 0 is at a bound¬ 
ary of the parameter space. As more training data are acquired, the shape of the estimated 
regression function changes. The algorithm learns that the optimal 9 is located away from 
the boundary, and the isocontours become oblique which indicates a negative (conditional) 
correlation between all three parameters. A negative correlation between /3 and A given the 
estimate of 6 is reasonable because an increase in transmissions inside the day care centers 
(increase of /3) can be compensated with a decrease of transmissions from an outside source 
(decrease of A). The co-infection parameter 6 is negatively correlated with j3 given the 
estimate of A because a decrease in the tendency to be infected by multiple strains of the 
bacterium (decrease of 9) can be offset by an increase of the transmission rate (increase of 
/3). The same reasoning applies to A given a fixed value of (3. 

We used the Gaussian process model of the discrepancy to compute the model-based 
likelihood Lu^. The threshold h was chosen as the 0.05 quantile of the modeled discrepancy 
at the minimizer of the estimated regression function. Model-based posterior inference was 
then performed via iterative importance sampling as described in Section 6.3 (using three 
iterations with 25,000 samples each). Figure 12 (left column) shows the inferred marginal 
posterior pdfs. They stabilize quickly as the amount of acquired data increases. 

The right column in Figure 12 compares our model-based results with the solution 
by Numminen et al. (2013) (blue horizontal lines with triangles) and with results by the 
population Monte Carlo (PMC) ABC algorithm of Section 3.5 (black curves with diamonds). 
Numminen et al. (2013) used a PMC-ABC algorithm as well but with a slightly different 
discrepancy measure (see Example 7). Both PMC results were obtained using 10,000 initial 
simulations to set the initial threshold, followed by four more iterations with shrinking 
thresholds where in each iteration, data sets were simulated till 10,000 accepted parameters 
were obtained. It can be seen that the posterior mean and the credibility intervals of the 
two PMC results match in the fourth generation, which indicates that our modification of 
the discrepancy measurement had a negligible influence. For the PCM results shown in 
black, iteration one to four required 121,374; 277,997; 572,007; and 1,218,382 simulations 
each, giving a total computational cost of 2,199,760 simulations for the results of iteration 
four. In terms of computing time, the PMC computations took about 4.5 days on a cluster 
with 200 cores. Our model-based results for t = 1,000 were obtained with one tenth of the 
initial simulations of the reference methods and took only about 1.5 hours on a desktop 
computer.^ Out of the computing time, 93% were spent on simulating the day care centers, 
and 7% on regression and optimization of the acquisition function. 

The posterior means of our model-based approach match quickly the reference results 
(red curves with circles versus blue curves with triangles). The focus on the modal re¬ 
gion yields, however, broader credibility intervals. The broader model-based posterior pdfs 
suggest that they could be used as auxiliary pdf for PMC-ABC or other iterative ABC 
algorithms which are based on importance sampling. Moreover, one could evaluate the 
discrepancy at the sampled points to obtain additional training data in order to refine the 
model. 


4. The simulation of the 29 day care centers in the model was partly parallelized hy means of seven cores. 
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(b) 100 data points, 9j = (3.61,0.52,0.11) 




Figure 11: Isocontours of the estimated regression function for the day care center model. 
Visualization is as in Figure 9. 
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Figure 12: Marginal posterior results for the day care center model. The left column shows the 
obtained model-based posterior pdfs. The right column compares the posterior mean 
and the 95% credibility interval with results from PMC-ABC algorithms. We obtain 
conservative estimates of the model parameters at a fraction of the computational cost. 
Posterior means are shown as solid lines with markers, credibility intervals as shaded 
areas or dashed lines. 


33 














































Gutmann and Corander 


8. Conclusions 

Our paper dealt with inferring the parameters of simulator-based (generative) statistical 
models. Inference is difficult for such models because of the intractability of the likelihood 
function. While it is an open question whether variational principles are also applicable, 
the parameters of simulator-based statistical models are typically inferred by finding values 
for which the discrepancy between simulated and observed data tends to be small. We 
have seen that such an approach is computationally costly. The high cost is largely due to 
a lack of knowledge about the functional relation between the model parameters and the 
discrepancies. We proposed to use regression to infer the relation using training data which 
are actively acquired. The acquisition is performed such that the focus in the regression is 
on regions in the parameter space where the discrepancy tends to be small. We implemented 
the proposed strategy using Bayesian optimization where the discrepancy is modeled with a 
Gaussian process. The posterior distribution of the Gaussian process was used to construct a 
model-based approximation of the intractable likelihood. This combination of probabilistic 
modeling and optimization reduced the number of simulated data sets by several orders 
of magnitude in our applications. The reduction in the number of required simulations 
accelerated the inference substantially. 

Our approach is related to the work by Rasmussen (2003) and the two recent papers by 
Wilkinson (2014) and Meeds and Welling (2014) (which became only available after we first 
proposed our approach at “ABG in Rome” in 2013): Rasmussen (2003) used a Gaussian 
process to model the logarithm of the target pdf in a hybrid Monte Garlo algorithm. There 
are two main differences to our work. First, a scenario was considered where the target can 
be evaluated exactly at a finite computational cost, even though the cost might be high. In 
our case, exact evaluation of the likelihood function is not assumed possible at finite cost. 
This difference is important because approximate likelihood evaluations might be rather 
noisy. The second difference is that we used Bayesian optimization to focus on the modal 
areas of the target. 

Related to the approach of Rasmussen (2003), Wilkinson (2014) modeled the log like¬ 
lihood as a Gaussian process. This is different from our work where we model the dis¬ 
crepancies. We believe that modeling the discrepancies is advantageous because it allows 
to delay the selection of the kernel and bandwidth which are needed in the nonparametric 
setting. This is important because it enables one to make use of all simulated data. In the 
parametric setting, the two modeling strategies lead to identical solutions. We found fur¬ 
ther that accurate point estimates can be obtained by modeling the discrepancies only. In 
particular, minimizing their regression function corresponds to maximizing a lower bound 
of the approximate nonparametric likelihood under mild conditions. As a second difference, 
Wilkinson (2014) used space-filling points together with a plausibility criterion to obtain 
the parameter values for the regression. This is in contrast to Bayesian optimization where 
powerful optimization methods are employed to quickly identify the areas of interest. 

Meeds and Welling (2014) proposed an alternative to the sample average approximation 
of the (limiting) synthetic likelihood by modeling each element of the intractable mean 
and covariance matrix of the summary statistics with a Gaussian process. The resulting 
likelihood approximation was used together with a Markov chain Monte Garlo algorithm for 
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posterior inference. The differences to our approach lie in the quantities modeled and in the 
use Bayesian optimization to actively design the training data for the Gaussian processes. 

There are also connections to the body of work on Bayesian analysis of computer codes 
(for an introduction to this field of research, see for example the paper by O’Hagan, 2006): 
Sacks et al. (1989) and Currin et al. (1991) modeled the outputs of general deterministic 
computer codes as Gaussian processes. The computer codes were, for example, solving 
complex partial differential equations, and the papers were about finding an emulator for 
the heavy computations. Inference of unknown parameters of the computer codes given 
observed data was only considered later by Gox et al. (2001) and Kennedy and O’Hagan 
(2001). The observed and simulated data were modeled using Gaussian processes, and 
space-filling points were used to choose the parameters for which the computer code was 
run. The main differences to our approach are again the quantities modeled, and the use 
of Bayesian optimization. 

We employed a rather basic algorithm to perform Bayesian optimization. This does, 
however, not mean that Bayesian optimization for likelihood-free inference is limited to 
that particular algorithm. We discussed a number of alternatives, as well as more advanced 
algorithms which could be used instead, and outlined a general framework for increasing 
the computational efficiency of likelihood-free inference. 

Our paper opens up a wide range of extensions and opportunities for future research. 
One possibility is to use the tools provided by Bayesian optimization to tackle the chal¬ 
lenging problem of likelihood-free inference in high dimensions. More foundational research 
topics would revolve around the modeling of the discrepancies and the development of ac¬ 
quisition rules which are tailored to the problem of likelihood-free inference. We focused on 
approximating the modal areas of the intractable likelihoods more accurately than the tails. 
It is an open question of how to best increase the accuracy in the tail areas. One possibility 
is to use the samples from the approximate posterior to update the training data for the 
regression, which would naturally lead to a recursion where the current method would only 
provide the initial approximation. 
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Appendix A. Proof of Proposition 1 

We split the objective , defined in Equation (32), into two terms, 

jf(0)=Ti(0)+r2(0), (48) 

ri(0) =log|detC 0 |, (49) 

T2{e) = [($o - $0)TCg - «h0)] . (50) 

Term T 2 can be rewritten using the empirical mean fig and the covariance matrix f^g in 
Equation (14), 


T2ie) = E^ [(4>„ - fig + fig- cD0)Tc-1(c^„ - fig + fig- cDg)] (51) 

= E^ [(4>„ - fig)'^Cg\<^o - M + (A. - <^eVCg\fig - «1>0) 
+2i<^o-figVCg\fig-<^g)] (52) 

= (<ho - fig)^Cg\<^o - fie) + tr E^^ [{fig - <^g){fig - $ 0 )^]) (53) 

= (4>o - fLg)^Cg\^o - fie) + tr , (54) 

where we have used that E'^ [4>0] = fig. Eor Cg = we have 

T2{e) = {^o-fie)~^'^e\^o-fie)+P- (55) 

Hence, for Cg = llg, equals 

(0) = log I det %\ + ($o - fig)'^tg\^o - fig) + P- (56) 

On the other hand, the log synthetic likelihood is 


^T(0) = -f log(27r) - ^ log I det - l(cl><, - fig)~^tg\^o - fig) 


so that 


(0) = p -plog(27r) - 2^7(0). (58) 

The claimed result follows now from Equation (31), 

logLf(0)>-|+if(0). (59) 

Replacing the empirical average E'^ with the expectation shows that the limiting quan¬ 
tities (.s arid Jg{d), 

Jg{e) = E [A^] , (60) 

are related by an analogous result. In more detail, 

Jg{e) = log I det Cg\ + E [($o - $e)Tc^i(4.„ - <^g)\ (61) 

= log I det Cg\ + (<I>o - iig)^Cg^{^o - fig) + tr , (62) 
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where we used the same development which led to Equation (54) but with the expectation 
instead of E'^. Hence, for Cg = we have the analogous result by definition of is in 
Equation (13), 

Jg{6) =p-plog(27r) -2£s{6). (63) 

It follows by definition of Jg that ig can be seen as a regression function where 6 is the 
vector of covariates and Ag is, up to constants and the sign, the response variable. 

Appendix B. Using the Prior Distribution of the Parameters in Bayesian 
Optimization 

In the main text, we focused on acquiring training data in regions in the parameter space 
where the discrepancy tends to be small, which corresponds to the modal regions of the 
approximate likelihoods. Eor highly informative priors pe with modal regions far away from 
the peaks of the likelihood such an approach is suboptimal for posterior inference. Since 
the prior is typically fairly broad and the likelihood peaked, this situation is not usual. But 
if it happens, it is better to directly acquire the training data in the modal areas of the 
posterior. Eor inference via the synthetic likelihood, this can be straightforwardly done by 
approximating ig + \ogpe- In Bayesian optimization with Ag as the response variable, the 
posterior mean pt in Equation (43) would then be replaced by /it(0) = Ptifi) — 2\ogpe- 
Eor inference via an nonparametric approximation of the likelihood, the same approach 
may also work but this warrants further investigations because the regression function J 
provides only a lower bound for the likelihood. We also note that using the prior po can be 
helpful if it is known that the parameters do not influence the model independently, causing 
for instance the discrepancy to be nearly constant along certain directions in the parameter 
space. 

Eigure 13 illustrates the basic idea using Example 1 and a prior pdf pe (blue curve) 
which has practically no overlap with the true likelihood L (green curve). The results are 
for Bayesian optimization with 20 deterministic acquisitions and a Gaussian process model 
with constant mean function. 

Appendix C. Bayesian Optimization with a Deterministic versus a Stochastic 
Acquisition Rule 

Example 10 illustrated log-Gaussian modeling and the stochastic acquisition rule by means 
of the Ricker model with the log growth rate logr as only unknown. We here show the 
differences between stochastic and deterministic acquisitions in greater detail. The results 
are for a log-Gaussian process model. 

Eigure 14 shows the estimated regression functions as obtained with a deterministic 
acquisition rule like in Eigure 7(d) for different t. The acquired data points are vertically 
clustered because the acquisition rule often proposed nearly identical parameters. Eigure 
15 shows obtained with a stochastic acquisition rule as in Eigure 7(f). While both 
methods lead to a satisfactory approximation of the negative log synthetic likelihood around 
its minimum, the result with the stochastic acquisition rule seems more stable because the 
acquired training data are spread out more evenly in the interval of interest. 
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(a) Bayesian optimization without using the prior pe 




(b) Bayesian optimization with the prior pe during the acquisitions 


Figure 13: Using the prior density pg in Bayesian optimization, (a) If pQ is not used (or if uniform), 
the focus is on the modal region of the likelihood. If the prior is far from the mode 
of the likelihood, the learned model is less accurate in the modal areas of the posterior 
(black dashed versus red solid curve), (b) The prior pdf pe was used to shift the data 
acquisitions in Bayesian optimization to the modal area of the posterior (see the circles 
in the figures on the right or on the x-axes on the left). This results in a more accurate 
approximation of the posterior pdf but a less accurate approximation of the mode of the 
likelihood (dashed magenta versus blue solid curve). 
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Figure 14: Log-Gaussian process model for the log synthetic likelihood of the Ricker model with 
log r as only unknown. The results are for the deterministic acquisition rule consisting 
of minimization of the acquisition function in Equation (45). Note the vertical clusters. 
The visualization is as in Figure 7. The plot range was restricted to (3.2,4.2) so that not 
all acquisition may be shown. 
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Figure 15: Log-Gaussian process model for the log synthetic likelihood of the Ricker model with 
log r as only unknown. The setup and visualization is as in Figure 14 but the stochas¬ 
tic acquisition rule is used. A movie showing the acquisitions and the updating of 
the model is available at http : / / www. cs . helsinkl. f i/u/gutmann/material/BOLFI/ 
movies/RickerlD.avi. 
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Appendix D. Ricker Model Inferred with a Markov Chain Monte Carlo Algorithm 

We here report the simulation results for the Ricker model inferred with the log synthetic 
likelihood and a random walk MCMC algorithm with the code made publicly available 
by Wood (2010). We ran the algorithm for 100,000 iterations, starting at 6o = (3.8, 0.3,10). 
The first 25,000 samples were discarded. In the work by Wood (2010), the proposal standard 
deviation for a was 0.1. Figure 16 shows that this choice led to a chain which got stuck close 
to (T = 0 even when N = 5,000 (blue, squares). Reducing the proposal standard deviation 
by a factor of 10 allowed us to obtain reasonable results (red, circles). The proposal standard 
deviations for the remaining parameters were the same as in the original publication. We 
then investigated the stability of the inferred posteriors when N is reduced from N = 5,000 
io N = 500 and when the simulator is run with different realizations of the random log 
synthetic likelihood. Figure 17 shows that the posteriors are stable for logr and cj) but that 
there is some variation for a. 





(b) innovation standard deviation cr 


(c) observation scalar (p 


Figure 16: Choice of the transition kernels for inference of the Ricker model via MCMC. We used 
N = 5,000 which is ten times more than in the original work (Wood, 2010). The dashed 
curves with markers are (rescaled) histograms, the solid curves kernel density estimates. 
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(a) log growth rate log r, varying N 
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(c) innovation standard deviation a , varying N 





(e) observation scalar cj ), varying N 
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(d) innovation standard deviation a , varying seed 



(f) observation scalar cj >, varying seed 


Figure 17: Effect of the number of simulated data sets N (left column) and the seed of the random 
number generator (right column) for the Ricker example when inferred with the method 
by Wood (2010). Visualization is as in Figure 16. 
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